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Abstract 

In prediction with expert advice the goal is 
to design online prediction algorithms that 
achieve small regret (additional loss on the 
whole data) compared to a reference scheme. 
In the simplest such scheme one compares 
to the loss of the best expert in hindsight. 
A more ambitious goal is to split the data 
into segments and compare to the best ex- 
pert on each segment. This is appropriate if 
the nature of the data changes between seg- 
ments. The standard fixed-share algorithm 
is fast and achieves small regret compared to 
this scheme. 

Fixed share treats the experts as black boxes: 
there are no assumptions about how they 
generate their predictions. But if the experts 
are learning, the following question arises: 
should the experts learn from all data or only 
from data in their own segment? The original 
algorithm naturally addresses the first case. 
Here we consider the second option, which 
is more appropriate exactly when the nature 
of the data changes between segments. In 
general extending fixed share to this second 
case will slow it down by a factor of T on T 
outcomes. We show, however, that no such 
slowdown is necessary if the experts are hid- 
den Markov models. 



1 Introduction 



ancy between our prediction and the actual outcome. 
Predictions may for example be in the form of a prob- 
ability distribution on outcomes. Loss may be logar- 
ithmic loss, i.e. the negative logarithm of the probabil- 
ity assigned to the outcome that actually occurs. The 
goal is to minimise our regret, which is the difference 
between our own cumulative loss on the whole data 
and the cumulative loss of a reference scheme, which 
typically involves tuned parameter settings unknown 
to us when we make our predictions. For the reference 
scheme there are several options; we may, for example, 
compare ourselves to the cumulative loss of the best 
expert in hindsight (after observing the data) . A more 
ambitious scheme, called tracking the best expert, is ad- 
dressed by the fixed-share algorithm of |Herbster and| 



Warmuth 1998 



In prediction with expert advice Cesa-Bianchi and 



Lugosi 2006 



a sequence of outcomes x\,xi, . . . needs 
to be predicted, one outcome at a time. Thus, predic- 
tion proceeds in rounds: in each round we first consult 
a set of experts, who give us their predictions. (We 
use the word expert for any source of predictions that 
is available to us as input.) Then we make our own 
prediction and incur some loss based on the discrep- algorithm's running time is linear in the number of 



1.1 Tracking the Best Expert 

In tracking the best expert (TBE), the goal is to 
achieve small regret compared to the following refer- 
ence scheme: 

(a) Split the data into segments. 

(b) Select an expert for each segment. 

(c) Sum the loss of the selected experts on their seg- 
ments. 

This reference scheme is appropriate if the nature of 
the data changes between segments. It is harder than 
comparing to the single best expert in hindsight, be- 
cause now there are more unknowns: both the seg- 
mentation (step [a| and the reference experts (step [b]) 
are unknown when we make our predictions. In par- 
ticular the reference experts may be the best experts 
in hindsight for their assigned segments. 

In 1995 Herbster and Warmuth introduced an efficient 
algorithm called fixed share (FS) and showed that it 
achieves small regret (see Theore m [T| below) compared 
to the TBE reference scheme of [Herbster and Warl 
muth 1998 . Given the predictions of the experts, the 
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outcomes and linear in the number of experts. Prob- 
lem solved. Or is it? 

1.2 Learning Experts 

In this paper we take another look at the TBE refer- 
ence scheme for learning experts and ask: if an expert 
is selected for some segment, then should the expert 
learn from all data or only from the data in that seg- 
ment? 

We may assume that the experts do not know the seg- 
mentation chosen in step [a] of the reference scheme. 
(Otherwise, why wouldn't we just ask them?) Hence 
if we treat the experts as black boxes and only ask for 
their prediction at each time step as in |Herbster and 
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Warmuth 1998 , it is natural that they learn from all 



data. We call this the standard interpretation of the 
TBE reference scheme (S-TBE). 

However, as the following example will illustrate, it 
may be beneficial if experts learn only from the seg- 
ment for which they are selected, because they may get 
confused by data in other segments that follow a differ- 
ent pattern. We call this the local learners interpreta- 
tion of tracking the best expert (LL-TBE) . As a slight 
complication, it will turn out that in LL-TBE we have 
a further choice: whether to tell a learning expert the 
timing of its segment or not, which generally makes a 
difference. When segment timing is preserved, we call 
the resulting reference scheme sleeping LL-TBE; when 
segment timing is not preserved we call the reference 
scheme freezing LL-TBE. The next example demon- 
strates that S-TBE and the two variants of LL-TBE 
are really different reference schemes. 

Example: Drifting Mean In applications one 
would usually build up complicated prediction 
strategies from simpler ones in a hierarchical fashion. 
For example, let us first define simple static experts, 
parametrised by fi £ K, which predict according to 
a standard normal distribution with mean fi in each 
round. Now define a learning expert DM[6>] that has a 
stochastic model for the (unobservable) drift of fi over 
time. This drifting mean learning expert predicts ac- 
cording to a hidden Markov model in which the hidden 
state at time t is fi t & n d the production probability 
of an outcome given fi t is determined by the simple 
expert with parameter fi t . Initially, fi\ — with prob- 
ability one. Then fit+i = Mt + 1 with probability 9 
and fit+i = (J-t with probability 1 — 9 for some fixed 
parameter 9. (See Figure [TJ) 

The expert DM[#] may be said to be learning, be- 
cause its posterior distribution of fi t given outcomes 
xi, . . . , Xt—i indicates how much credibility the expert 
assigns to each value of fit', high weight on, say, fit = 3 
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Figure 1: State Transitions for Learning Expert DM[0], 
which learns a drifting mean 
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Figure 2: The Difference Between S-TBE and the Two 
LL-TBE Reference Schemes. Note the logarithmic 



scale of the y-axis in (c) and (d) 



indicates that DM[6*] considers it likely for fi t — 3 to 
give the best prediction for x t . 

Figures [2a| and |2b| plot two artificial data sets. For Fig- 



2a sleeping LL-TBE is appropriate, for Figure 2b 
freezing LL-TBE is more suitable. The data consist 
of 10 segments of 100 outcomes. In each segment the 
outcomes are increasing deterministically at a rate of 
either 0.1 or 0.3 per outcome. Note that for the freez- 
ing data all segments start from 0, whereas for sleep- 
ing any segment looks like the proces that generated 
it started at at time 1, but went unobserved for a 
while. 



Figures 2c and 2d show the cumulative log(arithmic) 
loss for all three TBE reference schemes. Note that the 
difference between the schemes is so large that their 
losses had to be plotted on a logarithmic scale. In each 
case we consider two experts: DM[0.1] and DM[0.3] and 
use the expert DM [6*] for any segment with rate 9. The 
difference between the three schemes lies in which data 
is used by DM[6>] to learn from. In the S-TBE scheme 
DM[6>] is shown all the data, even those outside the 
segment it has to predict. In the two LL-TBE schemes, 
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on the other hand, a fresh copy of DM[6>] only sees the 
data in the segment for which it is selected: for freezing 
LL-TBE, DM[#] predicts as if the current segment is 
the only data; for sleeping LL-TBE, DM [(9] knows the 
timing of the segment it is predicting, and treats all 
samples preceding that segment as unobserved. Thus 
in sleeping LL-TBE the original timing of the segments 
is preserved, while in freezing LL-TBE it is lost. 

We see that for the sleeping data the sleeping LL-TBE 
reference scheme has much smaller loss than the other 
two schemes. And for the freezing data the freezing 
LL-TBE scheme has the smallest loss by far. (Mind 
the logarithmic scale of the y-axis, which puts the loss 
of sleeping LL-TBE deceptively close to the loss of 
freezing LL-TBE in Figure [2d) a constant offset indic- 
ates a fixed multiplicative overhead.) In both cases the 
reason for the large differences between the reference 
schemes is that DM [8] gets confused if it learns from 
the wrong data. 

1.3 Expert Hidden Markov Models 

The learning expert DM[6>] in the example above is 
a hidden Markov model in which the production 
probabilities (of outcomes given the state) depend 
on lower-level base experts. In general such predic- 
tion strategies are called expert hidden Markov models 
(EHMMs). The use of EHMMs is not restricted to 
describing learning experts. For example, many al- 
gorithms for prediction with expert advice, including 
FS itself, can be represented as EHMMs (see Koolen 



oni and Jaakkola 



and De Rooij [2008a and its references, and Montele- 
2003] ). In addition 



HMM is trivially an EHMM: just introduce lower-level 
base experts for its production probabilities. Not every 
algorithm can be represented as an EHMM, however. 



The follow-thc-perturbed-leader algorithm by Hannan 
1957] and variable share by Herbster and Warmuth 



1998 



, for instance, are exceptions. 



1.4 Fixed Share for Learning Experts 

LL-TBE Requires More Information The ex- 
ample above shows that there is a large difference 
between S-TBE and the sleeping or freezing LL-TBE 
reference schemes. One may therefore wonder whether 
there exists an algorithm that achieves small regret 
compared to LL-TBE. Unfortunately, no algorithm 
will be able to do the job without additional know- 
ledge about the learning experts. To see this, note 
that the reference scheme may split the data into seg- 
ments in any way it sees fit. But black-box experts 
are not telling us what their predictions would be for 
any possible segmentation; they only give us a single 
prediction each round. Therefore, even if we knew the 



segmentation and the selected expert for each segment, 
we still would have insufficient information to achieve 
the reference scheme. The only way to address this 
problem is to get more information about the learn- 
ing experts. This information should have an efficient 
representation and should somehow tell us what the 
learning experts would predict for any possible seg- 
mentation. 

Copying Experts is Less Efficient The straight- 
forward approach would be to introduce a fresh copy 
of each expert for each possible start of a new seg- 
ment and run the original fixed-share algorithm on the 
resulting enriched set of experts. But then the num- 
ber of experts would grow linearly with the number of 
rounds, and consequently the total running time would 
go up from linear to quadratic in the number of out- 
comes. As this makes the difference between an online 
algorithm that can run forever and an algorithm that 
effectively comes to a stop after, say, 10 5 outcomes, it 
is worth seeing whether such an increase in running 
time is really unavoidable. 

EHMMs: the Efficient Special Case As we will 

show, it turns out there is a special class of learn- 
ing experts for which no increase in running time is 
necessary. These are the learning experts that can 
be described in EHMM form. Although this excludes 
learning experts that for example implement follow- 
the-perturbed-leader, the class of EHMMs is still rich 
enough to be of interest, if only because it includes all 
ordinary HMMs. In the interpretation of the two LL- 
TBE reference schemes for learning experts in EHMM 
form, we do need to be careful if the base experts in the 
EHMMs are learning themselves: because we make no 
assumptions about the base experts, they always learn 
from all the data. 

Main Result: Achieving LL-TBE Efficiently 

We present two new algorithms: FS sl for sleeping LL- 
TBE and FS fr for freezing LL-TBE, which both gen- 
eralise FS. We show that these algorithms achieve the 
same regret bound compared to their respective LL- 
TBE reference schemes as FS achieves compared to 
the S-TBE reference scheme. In addition, FS sl runs 
equally fast as the original fixed-share algorithm; for 
FS fr no slowdown occurs either if the EHMMs for the 
learning experts have a finite number of hidden states, 
otherwise it is typically still faster than just copying 
the experts. 

Like fixed share, our new algorithms can be repres- 
ented as EHMMs. In fact, we will build up both al- 
gorithms by describing how to combine the EHMMs 
for the learning experts, which the algorithms get as 
inputs, into a single larger EHMM. Apart from intro- 
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ducing the LL-TBE reference scheme, this construc- 
tion is our main result: regret bounds follow from the 
EHMM representations using methods described in 



[Koolen and De Rooij, 2008a , and the algorithms are 
simply instances of the forward algorithm for EHMMs. 



1.5 Overview 

We start by formally introducing prediction with ex- 
pert advice in the next section. Then Sj3] reviews 
EHMMs, including the representation of FS as an 
EHMM. It is shown how the standard regret bound 



for FS by Herbster and Warmuth 1998 can be proved 
using this representation. In f]4]wc formally define the 
freezing and sleeping LL-TBE reference schemes and 
present our new algorithms. Then we prove their re- 
gret bounds and state their running times. 



2 Preliminaries: Prediction With 
Expert Advice 

In this section we formally introduce the online learn- 
ing setting of prediction with expert advice. In this 
setting prediction proceeds in rounds. In each round 
t, we first receive advice from each expert e £ £ in 
the form of an action G A. Then we distill our 
own action at € A from the expert advice. Finally, 
the actual outcome x t €E X is observed, and every- 
body suffers loss as specified by a fixed loss func- 
tion t: A x X — > [0, oo]. Thus, the performance 
of a sequence of actions a\.T = ai,...,ay on data 
xi;T = x\, . . . ,xt is measured by the cumulative loss 
£(ai :T ,xi :T ) = I]f =1 ^(a t ,x t ). 

We present our results for log(arithmic) loss only, 
which allows us to draw on familiar concepts from 
probability theory, like e.g. conditional probabilities 
and hidden Markov models. Their generalisation to 
arbitrary mixable losses is straight-forward using the 
methods oflVovkl 11999 . 



Log Loss For log loss the actions A are probabil- 
ity mass (or density) functions on X and £(p, x) — 
— log p(x) for any p € A, where log denotes the natural 
logarithm. Notice that minimising log loss is equival- 
ent to maximising the predicted probability of outcome 
x. We write pf for the prediction of expert e at time 
t and denote the predictions for all experts jointly by 
pf . Another important property of the log loss is the 
chain rule: interpreting any prediction p t (xt) as the 
conditional probability P(xt\x < t) of outcome Xt given 
all past outcomes a;<t = x\, . . . , Xt-i, we see that the 




Figure 3: Bayesian Network Specification of an 
EHMM 



cumulative log loss of a sequence of predictions 



T 

*=i 



•logp t (xt) = -log JJP(xt|x< t ) = -logP(x 1:T ) 
t=l 

(1) 

equals the negative logarithm of the joint P- 
probability of all data X\-x- Thus any lower bound 
on P(x\ : t) directly implies an upper bound on the cu- 
mulative loss of predictions p 1; . . . , p T on data X\-t- 

Segments For m < n, we abbreviate the segment 
{m, . . . , n} to m:n. For any sequence j/j. , J/a , . . . and 
any segment C — m:n we write yc for the sub- 
sequence y m , . . . , y n . For example, x m:n =x m ,...,x n 
and pf. T = pf,...,pj.. If all segments in a family 
C = {Ci,C2,...} are pairwise disjoint and together 
cover 1:T, then we call C a segmentation of 1:T. We 
denote by (ec € £)cec ^ ne labelling that assigns expert 
ec to segment C. 

3 Expert Hidden Markov Models 

EHMMs were introduced by Koolen and De Rooij 



2008a as a graphical and computational language to 



specify strategies for prediction with expert advice. 
EHMM diagrams directly represent the internal struc- 
ture of the prediction strategy, facilitating the deriv- 
ation of loss bounds. Moreover, there is a standard 
algorithm for sequential prediction, the forward al- 
gorithm, which greatly simplifies derivation of running 
time bounds. 

In this paper, we use EHMMs in two ways. On the 
input side, we use them to represent the learning ex- 
perts whose predictions we want to combine. On the 
output side, we specify our own prediction strategies 
based on expert advice as EHMMs. 

An EHMM f) is a probability distribution that is con- 
structed according to the Bayesian network in Fig- 
ure [3j It is used to sequentially predict outcomes X%, 
X%, . . ., which take values in outcome space X, using 
advice from a set of experts £. At each time t, the dis- 
tribution of X t depends on a hidden state Qt, which 
determines mixing weights for the experts' predictions. 
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Formally, the production function p^ determines the 
interpretation of a state: it maps any state qt £ Q 
to a distribution p? 4 on the identity E t of the expert 
that should be used to predict X t . Then given E t = e, 
the distribution of X t is expert e's prediction p^. It 
remains to define the distribution of the hidden states. 
The starting state Qi has initial distribution p , and 
the state evolves according to the transition function 
p_>, which maps any state qt to a distribution p^ on 
its successor states. 

An EHMM S) defines a prediction strategy as follows: 
after observing x <t , predict the next outcome X t us- 
ing the marginal fi{Xt\x<t), which is a mixture of the 
experts' predictions pf . 

We present four example EHMMs. The first three ex- 
amples are suitable as input learning experts, which 
might be combined in the sleeping or freezing LL-TBE 
reference scheme. The fourth example represents FS 
as an EHMM, which will later be helpful when we 
compare it to our new generalisations. 

Example 3.1 (Figure [lj Expert that Learns a Drift- 
ing Mean). Here we formally define the EHMM DM[0] 
from the example in the introduction. Recall that the 
base experts predict according to standard normal dis- 
tributions with fixed mean /i, which only takes integer 
values. Thus 



P?(*) 



1 



2tt 



for all fx £ £ := N = {0, 1,2,.. .}. In this EHMM it is 
sufficient to have a one-to-one correspondence between 
hidden states and experts, such that Q t = E t . This is 
expressed by Q := £ and p^ := I, where I denotes the 
identity operator. The definition of DM[#] is completed 
by letting the initial distribution p D be a point-mass 
on /z = 0, and defining the transition function p^ as 
in Figure [TJ for any two states /i, p! £ Q 



P^(M') := \ 1-' 




if \j! = (i+ 1, 
if fi' = (i, 
otherwise. 



O 



Example 3.2 (Bayes on base experts). Consider the 
Bayesian mixture (also known as the exponentially 
weighted average predictor) of base experts £ with 
prior w. We identify this prediction strategy with the 
following EHMM B[w], which makes the same predic- 
tions. As in the previous example, let Q '■= £ and 
p, := I, so that Q t = E t . This time, however, let 
p D := w and p^ := I. Despite its deceptive sim- 
plicity, this EHMM learns: its marginal distribution 
of Xt+i given previous outcomes x\-_% is a mixture of 
the base expert's predictions according to the Bayesian 
posterior. O 



Example 3.3 (Bayes on EHMMs). Let U = 
. . . ,fy n } be EHMMs with base experts £ 1 , . . . , £ n , 
and let hi be a prior on Ji. Then, instead of treat- 
ing Sj 1 , . . . , f) n as black box predictors as in the pre- 
vious example, their Bayesian mixture can also be 
expressed as a single EHMM B[io,H] on the union 
of their base experts £ := U™=i£ l; assume without 
loss of generality that f) 1 , . . . ,f) n have disjoint state 
spaces Q 1 , . . . , Q™ and let Q := U"=i 2*- For any state 
g £ Q 1 , let p* equal p^' 1 , where p* is the production 
function of S) 1 , so that all states keep their original in- 
terpretation. In addition let p (q) := w(i) p l (q), where 
Po denotes the initial distribution of fj\ Finally, let 
p^(g') equal pl' l ((/)j the transition probability from q 
to q' for 9f if q,q' £ Q 1 and let p^(q') := otherwise. 
Again, this EHMM learns which of the EHMMs in Ti 
is the best predictor. O 

Example 3.4 (Fixed share). The fixed-share al- 
gorithm take a parameter a, called the switching rate. 
Fixed share with prior distribution w on experts £ 
and switching rate a can be represented as an EHMM 
FS[a,u>] as follows. As in the Bayesian mixture on 
base experts, let Q := £ and p^ := I, so that Q t = Et, 
and let p Q := w. Instead of the identity operator, how- 
ever, use the transition function 

p^ := (1 — a)I + awl T , 

where 1 T denotes the operator that sums the probab- 
ility masses of all the hidden states. This transition 
function may be interpreted as follows: behave like 
the Bayesian mixture with probability 1 — a, but with 
probability a take all the probability mass and redis- 
tribute it according to the prior w. Observe that for 
any probability distribution A on states Q, we can com- 
pute p^ A = (1 — a) A + aw in constant time per state. 



We also note that in iHerbster and Warmuth 1998 the 



prior w is always taken to be the uniform distribution, 
which gives the best worst-case regret bound. O 

3.1 Standard Fixed Share Loss Bound 

To demonstrate the graphical derivation of loss bounds 
for EHMMs we now prove a regret bound for FS us- 
ing its representation as an EHMM. The general tech- 
nique is to give lower bounds on the transition func- 
tion and the initial distribution. For simplicity the 
bound we show is slightly weaker than the standard 



regret bound Herbster and Warmuth 1998| Corollary 
1]. (One could get the exact same bound by taking 
into account the remark in footnote 3 of [Koolen and 
but this unnecessarily complicates 



2008a 



De Rooij, 
the proof.) 

Theorem 1. Fix a prior w on experts £ and a switch- 
ing rate a. Then for any data Xx-t, expert predictions 
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pf. T , reference segmentation C and assignment of ex- 
perts to segments (ec € 8) 



cec 



l(FS[a,w],x hT ) < 

^c^c) + {T-l) R{a*,a) + £ - log w(e c ), 



cec 



cec 



S-TBE rcf. scheme 



Switching 



Expert selection 



where H(a, j3) = —a log/? — (1 — a) log(l — j3) and a* = 

T-l ' 

Note that if w is the uniform distribution then 
— logw(ec) = log|£| for all ec- Then the difference 



with the standard bound in Herbster and Warmuth 



1998 is (|C| - l)(log \8\ - logp] - 1)), which is neg- 



ligible. 

Proof. Recall that FS = FS[a,u>] has transition func- 
tion = (1 — a)I+Q!U>l T . Therefore for any reference 
segmentation C the joint probability FS(iei:t) of any 
data sequence x\-t can be bounded from below by re- 
placing transitions in FS between segments by awl T , 
and those within the same segment by (1 — a)I. The 
EHMM then degenerates into a sequence of independ- 



ent Bayesian mixture EHMMs B[w] (see Example 3.2 1, 
one for each segment. Therefore 



FS(a; 



1:T. 



> a 



|C|- 



cec 



Similarly we can lower-bound the initial distribution 
of B[w] by a function that assigns weight w{ec) to the 
expert ec selected for C in the reference segmenta- 
tion and is otherwise. It follows that B[u>](ire) = 
J2 e w ( e )Pc( x c) > w(e c )p c c (x c ), where v e c {x c ) de- 
notes the joint probability of outcomes xq according 
to the predictions of expert e. Hence by ([!]) we can 
conclude that 

^(FS,a; 1:T ) = -logFS(z 1:T ) 
< -logal c l- 1 (l- a ) T -l c l+^-logp^(x c )-log W (e c ) 

CeC 

= (T~l)H(«*,a) + ^^( eC;a ; c ) + ^-log W (e c ), 

cec cec 

which completes the proof. □ 

4 Fixed Share for Learning Experts 

In this section we define the freezing and sleeping LL- 
TBE reference schemes for learning experts. Then, for 
each scheme, we provide our prediction strategy FS ,r 
and FS sl and we prove that it achieves as small regret 
as FS. 
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Figure 4: Freezing and Sleeping EHMM on Example 
Segment x^-.5 



4.1 LL-TBE and the Loss of an EHMM on a 
Segment 

In order to state the loss of the freezing and sleeping 
LL-TBE reference schemes, we first define the loss of 
a single learning expert on a single segment. Then we 
define the loss of a whole segmentation. 

Let f) be the EHMM for a learning expert with arbit- 
rary base experts 8. Then the freezing and sleeping 



>i:j 



probability distributions Sylj and Sjf.j on segment 
are specified by the Bayesian networks of Figure [I] For 
freezing, the state at time i is simply initialised ac- 
cording to Sj's initial distribution p Q . For sleeping, we 
forward the initial distribution to time i by repeatedly 
applying the transition function p_,. Thus, the cu- 
mulative freezing and sleeping losses of on segment 
x i: j are given by i^.^Xi-.j) := - logS$. j (xi : j) and 
£(Sjf.j,Xi : j) := —]ogSjf.j(xi : j). Note that we treat the 
base experts 8 as black boxes, so they may learn from 
the whole data. 



Definition 1 (LL-TBE reference loss). Fix data x\ : t 
and a set of EHMMs T~L. Let C be a segmentation of 
1:T and let (Sjc & tyceC ^ e an assignment of experts 
to segments. Then the losses of the freezing and sleep- 
ing LL-TBE reference schemes are J^ceC ^(^ei x c) and 
Ecec W.^c)- 



Note that selecting a learning expert on consecutive 
segments differs from selecting that expert on their 
union, since experts are reset between segments. 
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Any switch reverts to pjf , the initial distribution of 03. 
(a) EHMM FS ,r [a,Q3] 
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The switch between time t and t+l reverts to (p® )* p® , 
the t th evolution of the initial distribution of < B. 
(b) EHMM FS sl [a,23] 

Figure 5: EHMMs for Tracking the EHMM 23 with 
Switching Rate a 



4.2 Main Result: Construction of the 
Freezing and Sleeping EHMMs 

We now present the construction of EHMMs for the 
freezing and sleeping algorithms FS fr and FS sl . Let 
H be a set of learning experts, each expert F> E H 
presented as an EHMM on basic experts 8. Let w 
be a prior on "H, and let a be a switching rate. We 
proceed in two steps. First construct the Bayesian 
EHMM 03 = B[w,U] as in Example |3~3| Recall that 
23 learns which of the 



EHMMs in U predicts best. 
Second, construct the freezing EHMM FS fr [a, 53] or 
the sleeping EHMtv{]FS sl [a, 33] as shown in Figure [5] 
Note how, on a switch, both EHMMs reset the entire 
state of 23, which includes the states of experts in %. 
In contrast, FS only resets its weighting on but 
does not touch the internal state of the experts in H. 



1 Strictly speaking, the Bayesian network in Figure 5b 



not an EHMM, since the transition function depends on the 
time. Nevertheless, this time-dependency can be removed 
without any computational overhead using a process called 
unfolding, see [Koolen and De Rooij, 2008b . 



4.3 Prediction Algorithms 

To sequentially predict data using our prediction 
strategies FS fr and FS sl , one needs to run the forward 
algorithm on their respective EHMMs. An explicit 
rendering of this process is included in Algorithm [l] 



1: Construct 23 = B[w,H] with Q, p a ,p l and p^ as 
in Example |3.3| 
Initialisation: A p Q 

for t = 1, . . . do > Invariant: A(q) = FS" [a, 93] (Qt = q\x <t ) 

Receive expert advice pf . 
Predict X t using 

ee£,q£Q 

Observe X t = x t - Suffer loss i (A(A t ),x t ). 
Loss update: X(q) X(q,x t )/X(x t ), where 

State evolution: 

(1 — a) p^ A + a p Q (Freezing) 



8 



A 



9: end for 



(l-a)p^A + a(p^) t p (Sleeping) 



Algorithm 1: Explicit Forward Algorithm on FS V for 
both Freezing and Sleeping (v G {fr, si}) 

At any time i, the algorithm for FS sl only maintains 
non-zero weights on hidden states of the input learning 
experts that are reachable in exactly t steps from the 
starting states, just like the original FS algorithm. It 
therefore has the same running time. The algorithm 
for FS fr , however, has to keep track of all states reach- 
able in at most t steps. Consequently, in the worst case 
(over input EHMMs) it may be as slow as restarting 
expert copies (see j |1.4| . But if the input EHMMs have 
a finite number of hidden states, then its running time 
is of the same order as that of FS. And if the states 
(of the input EHMMs) that are reachable in exactly t 
steps are the same ones as the states reachable in at 
most t steps, which holds e.g. for the drifting-mean ex- 
pert DM [9] from the introduction, then we also recover 
the efficiency of FS. 

4.4 Loss Bound 

Theorem[l]bounds the regret of FS compared to the S- 
TBE reference scheme by a "switching" and an "expert 
selection" term. We bound the regret of FS fr and FS sl 
compared to their LL-TBE reference scheme by the 
same two terms. 

Theorem 2. Fix a set of EHMMs T-L on basic experts 
E , a prior w onH, a switching rate a and v e {fr, si}. 
Let 23 = B[u>,%]. Then for any data x\-.t, expert 
predictions pf. T , reference segmentation C and assign- 



Switching between Hidden Markov Models using Fixed Share 



ment of experts to segments ($jc € H) CgC 

£(FS v [a, s B],x 1:T ) < 
mi ,x c ) + (T-l) H(a* , a) + ^ - log W (% ), 



CGC 



CgC 
/ \ 



LL-TBE ref. scheme Switching Expert selection 

where H(a*,a) and a* = are as ^ n Theoremll 



Proof. The proof proceeds like that of Theorem [T] 
Lower-bounding transitions between segments by 
apf 1 T (freezing) or a(p5)*p®l T (sleeping), and 
transitions within each segment by (1 — a) p5 , we get 



FS>, 05] > a^-^l - «) T "I C I J] 23£( 



xc) 



(2) 



cec 



where 03^. denotes the result of freezing or sleeping 03 
on segment C £ C as in Figure |4j Observe that freez- 
ing and sleeping distribute over taking the Bayesian 
mixture: <8£ = B[w,H v c ], where W c := W c \ Sj £ H}. 
As B[w,n v c ](x e ) = T,*w(?>)?>c(xc) > wftcWcfrc), 
the theorem follows from ([!]) , like in the proof of The- 
orem [T] □ 

5 Conclusion 

We revisited the tracking the best expert reference 
scheme (TBE), which asks for a strategy for predic- 
tion with expert advice that suffers small additional 
loss compared to the best expert per segment. This 
goal is natural when the characteristics of the data, 
and hence the best expert, are different between seg- 
ments. 

For learning experts, the standard interpretation of ex- 
perts as black boxes implies training the experts on all 
data. We proposed a variation, adapted to learning 
experts, in which experts are only trained on the seg- 
ment on which they are evaluated. Our scheme is able 
to exploit patterns in the data per segment, leading to 
smaller loss. 

Although in general extending the standard fixed- 
share algorithm to our setting will slow it down by 
a factor of T on T outcomes, we showed that no 
such slowdown is necessary if the learning experts 
can be represented as expert hidden Markov models 
(EHMMs). We proved the loss bounds one would ex- 
pect based on the loss bound for the original fixed- 
share algorithm. 

5.1 Discussion and Future Work 

Learning the Switching Rate Like fixed share, 
our algorithms depend on a switching rate parameter 



a, which has to be fixed. Instead, one may want to 
tune a automatically based on the data. For FS this 
can be done efficiently (see [De Rooij and Van Er- 
and references therein). The same methods 



2009 



transfer directly to FS fr and FS sl . 



S-TBE vs LL-TBE We have discussed experts that 
learn only on their assigned segment. Perhaps surpris- 
ingly, this does not always increase performance. For 
example, we may have homogeneous data and experts 
that learn its global pattern at different rates. In such 
cases we clearly want to train each expert on all obser- 
vations and, by switching at the right times, select the 
expert that has learned most until then. This scenario 

where experts 



is analysed by Van Erven et al. 2008 



are parameter estimators for a series of statistical mod- 
els of increasing complexity. 

Partitions instead of Segmentations Rather 
than split the data into segments as in the TBE refer- 
ence scheme, one may wish to partition it arbitrarily 
into cells such that observations in the same cell need 
not be consecutive. Like fixed share, the corresponding 



algorithm Bousquet and Warmuth 2002 can be gen- 



eralised to the LL-TBE setting without increasing its 
running time. In this case naively introducing copies 
of the experts for all possible partitions is infeasible: 
it would slow down the algorithm by an exponential 
factor 2 T on T outcomes. 
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